HSS8005
  • Module plan
  • Resources
  • Materials
  • Data
  • Assessment
  • Canvas

Week 4 Computer Lab Worksheet

  • Weekly materials

  • Week 1
    • Notes
    • Presentation
    • Worksheet
    • Handout
  • Week 2
    • Notes
    • Presentation
    • Worksheet
    • Handout
  • Week 3
    • Notes
    • Presentation
    • Worksheet
    • Handout
  • Week 4
    • Notes
    • Presentation
    • Worksheet
    • Handout
  • Week 5
    • Notes
    • Presentation
    • Worksheet
    • Handout
  • Week 6
    • Notes
    • Presentation
    • Worksheet
    • Handout
  • Week 7
    • Notes
    • Presentation
    • Worksheet
    • Handout
  • Week 8
    • Notes
    • Presentation
    • Worksheet
    • Handout

On this page

  • Aims
  • Setup
  • Exercise 1: Estimating the persuasive power of the news media
    • Data exploration
    • Data wrangling
    • Comparing proportions
    • Linear probability model
    • Logit and Probit
  • Exercise 2: Predicting the Frequency of Social Interaction using Poisson

Week 4 Computer Lab Worksheet

Generalised linear models

Author

Chris Moreh

Aims

In this lab you will practice fitting various types of generalised linear models. These models generalise linear regression to situations where the outcome (dependent) variable is not drawn from a normal (Gaussian) distribution. We will look at examples of logit/probit and Poisson models. Continuing the idea of causal estimation from week 3, we start with data underpinning the article by Ladd and Lenz (2009) and will attempt to replicate a simpler version of their probit model (see their Table 1A). We then fit a Poisson model with data from Weiss et al. (2021), attempting to replicate their Table S6.

By the end of the session, you will:

  • learn how to fit logit, probit and Poisson models
  • gain experience summarising, visualising and interpreting results from these models
  • practice further data manipulation techniques

Setup

In Week 1 you set up R and RStudio, and an RProject folder (we called it “HSS8005_labs”) with an .R script and a .qmd or .Rmd document in it (we called these “Lab_1”). Ideally, you saved this on a cloud drive so you can access it from any computer (e.g. OneDrive). You will be working in this folder. If it’s missing, complete Task 2 from the Week 1 Lab.

You can create a new .R script and .qmd/.Rmd for this week’s work (e.g. “Lab_4”).

We can also load the packages that we will be using in this lab. As the number of libraries needed for completing more advanced analyses increases, it’s worth using some method that avoids repetition, such as the p_load() function from pacman:

## First, install {pacman} if not yet installed:
if (!require("pacman")) install.packages("pacman")

## Then use {pacman} to attach packages and if they are not yet installed, install them:
pacman::p_load(tidyverse, sjPlot, sjlabelled, sjmisc,
               modelsummary, summarytools, gtsummary,
               marginaleffects)
If install.packages() throws up an error in Rmd/qmd:

If you are working in Rmarkdown/Quarto and you get an error message that includes the description

...
Error in contrib.url(repos, "source") :
  trying to use CRAN without setting a mirror
...

you can specify the CRAN mirror to be used by adding

options(repos = list(CRAN = "http://cran.rstudio.com/"))

before the call to install.packages()

Exercise 1: Estimating the persuasive power of the news media

In this exercise we will focus on data from an article by Ladd and Lenz (2009) (follow the doi link in the citation to access the article; you can also download it here). In this article the authors aim to estimate whether (print) media has the power to persuade people to vote differently. In general terms, this is a very interesting question, even though the power of print media has definitely weakened over the past few decades and measuring the effect of alternative media sources would present additional challenges. Nevertheless, the authors attempt to take advantage of a unique natural experiment that arguably facilitates tackling this causal question: between the 1992 and 1997 UK general elections, four newspapers (the Sun, Daily Star, Independent, and Financial Times) changed their editorial stance and tone from supporting the Conservative Party to supporting Tony Blair’s Labour Party. Such radical shifts editorial stance are extremely rare. The data they use come from several waves of the British Election Panel Study from between 1992 and 1997 and include variables on voting behaviour in 1992 and 1997, as well as whether the respondent was a reader of one of the “switching” newspapers. These are the main variables that are useful for tackling the causal effect involved in the research question, but there are several other variable that provide various controls.

Data exploration

We can load a version of the dataset from the course website. The dataset is in Stata format (.dta), so we’ll use one of the import functions we’ve used before:

# Let's call the data object "news" for simplicity

news <- sjlabelled::read_stata("https://cgmoreh.github.io/HSS8005-data/LaddLenz.dta")

Because it’s a Stata dataset, the variables are likely to have useful labels that can provide more information about them. We have already used the sjPlot::view_df() function, which is very useful for this:

view_df(news)

Have a look at the table to get a sense for the data. Table 1S in the Appendix to the article offers more details on the coding and meaning of the variables, which will be very useful later. You can download the Appendix here.

Questions

  • What does the hhincq92 (“Prior Income”) variable measure, and how are its levels coded?
  • What might the values coded as “9” in some of the variables mean?

To better understand the dataset and check whether any data management tasks are needed before the statistical analysis, let’s learn a few new functions that can come handy. There are various packages that include functions to summarise data, and we will look at three further options: modelsummary, gtsummary and summarytools.

summarytools::dfSummary()

This is a nice quick function that produces more complex “codeplans” than sjPlot::view_df() for labelled data in a similar HTML format. The easiest way to use it is to combine it with summarytools::view() as part of the same pipeline to get a nicely formatted HTML table in the Viewer pane:

summarytools::dfSummary(news) |> summarytools::view()

gtsummary::tbl_summary()

This is another option that works well with labelled data and produces summaries of both numeric continuous and categorical variables in a publication-ready table style, but it can be very useful for quick descriptive checks. It’s a little bit slow:

gtsummary::tbl_summary(news)

modelsummary::datasummary_skim()

This function doesn’t show variable labels and by default only summarises “numeric” variables; categorical variables can be tabulated separately by adding the argument type = "categorical". It works on our dataset because all the variables are currently recognised as numeric, but it’s not the most efficient for labelled data. We’ll look at it because the {modelsummary} package contains a number of other extremely useful functions that produce flexible tables, which we will use later:

modelsummary::datasummary_skim(news)

Questions

  • What else have we learnt about the data?
  • Are there any data transformation tasks necessary before we use the data for statistical modelling?

Data wrangling

It really appears that the values coded as 9 across several variables are in fact “missing” values. Instead of excluding them from the statistical analysis by setting them as NA, we will keep them, just as the authors of the original study have done. But we’ll relabel the categorical (factor) variables before further analysis.

There are some other strange inconsistencies that we should take care of. Let’s take them step-by step in the code-chunk below using some convenience functions from sjlabelled and sjmisc (note that everything could be done with some extra coding using core tidyverse packages):

news <- news |> 
  
## wkclass should be an indicator dummy, but kept as numeric;
## `dicho` function by default creates a new variable with "_d" as suffix in the name;
## setting `suffix = ""` overwrites the original variable
  sjmisc::dicho(wkclass, dich.by = 0.5, as.num = TRUE, suffix = "") |> 
  
## categories (levels) can be labelled
  sjlabelled::set_labels(know_3, labels = c("low", 
                                           "medium", 
                                           "high")) |> 
  sjlabelled::set_labels(copemg92, labels = c("Mortgage Very Difficult" = 1, 
                                              "Mortgage a Bit Difficult" = 0.5, 
                                              "Not Really Difficult or No Mortgage" = 0,
                                              "Not answered" = 9)) |> 
  sjlabelled::set_labels(hedqul92, labels = c("Less than O level (or foreign qual)" = 0,
                                              "O Level or Equivalent" = .25,
                                              "A Level or Equivalent" = .5,
                                              "Some Higher Education" = .75,
                                              "College Degree" = 1)) |> 
  sjlabelled::set_labels(hhincq92, labels = c("5999 or Less", 
                                              "6000-11,999",
                                              "12,000-19,999",
                                              "20,000 or More",
                                              "Not answered")) |> 
  sjlabelled::set_labels(ragect92, labels = c("18-24", 
                                              "25-34",
                                              "35-44", 
                                              "45-54", 
                                              "55-59",
                                              "60-64", 
                                              "65+", 
                                              "Not Given")) |> 

## some variables should be categorical (factor)
  
  ### first, change to character type to overwrite the fractional numeric codes with the labels
  sjlabelled::as_character(know_3, hedqul92, hhincq92, ragect92) |> 
  
  ### then, change to factors; labels will show, levels are renumbered starting from 1
  sjlabelled::as_factor(know_3, hedqul92, hhincq92, ragect92) |> 

##### instead of the two commands above, changing to `sjlabelled::as_label()` has the same effect
##### note also that we did not set "copemg92" as categorical, just because the authors will treat is as numeric in their model, even if it doesn't make much sense, especially because of the "Not answered" category

## Finally, some changes will make it easier to read the model as fit by the authors:

  ### will change the label associated with the `tolabor` variable (the "treatment" variable)
  sjlabelled::var_labels(tolabor = "Treatment (read switching paper)")

We can check some frequency tabulations to see the outcome:

# Five frequency tables
news |> sjmisc::frq(know_3, copemg92, hedqul92, hhincq92, ragect92)

Or we can check a whole summary table of the data frame again, which now looks like this:

gtsummary::tbl_summary(news)
This is what your table should look like:
Characteristic N = 1,593
outcome: voted labor in 1997 718 (45%)
Treatment (read switching paper) 211 (13%)
Prior Conservative Identification 666 (42%)
Prior Labour Identification 506 (32%)
Prior Liberal Identification 241 (15%)
White 1,557 (98%)
Working-Class 955 (60%)
Parents Voted Labour 581 (36%)
Prior Ideological Moderation 0.69 (0.49, 0.83)
Prior Labour Vote 528 (33%)
Prior Conservative Vote 640 (40%)
Prior Liberal Vote 293 (18%)
Prior Labour Party Support 0.50 (0.25, 0.75)
Prior Conservative Party Support 0.50 (0.25, 0.75)
Prior Political Knowledge
high 744 (47%)
low 252 (16%)
medium 597 (37%)
Prior Television Viewer 446 (28%)
Prior Daily Newspaper Reader 1,111 (70%)
Prior Ideology 0.55 (0.41, 0.68)
Authoritarianism 0.56 (0.48, 0.64)
Prior Trade Union Member 378 (24%)
Prior Coping Mortgage
0 272 (17%)
0.5 511 (32%)
1 807 (51%)
9 3 (0.2%)
Prior Education
A Level or Equivalent 187 (12%)
College Degree 636 (40%)
Less than O level (or foreign qual) 168 (11%)
O Level or Equivalent 271 (17%)
Some Higher Education 331 (21%)
Prior Income
12,000-19,999 377 (24%)
20,000 or More 528 (33%)
5999 or Less 245 (15%)
6000-11,999 313 (20%)
Not answered 130 (8.2%)
Prior Age
18-24 103 (6.5%)
25-34 289 (18%)
35-44 347 (22%)
45-54 344 (22%)
55-59 128 (8.0%)
60-64 118 (7.4%)
65+ 243 (15%)
Not Given 21 (1.3%)
Male 864 (54%)
North West 143 (9.0%)
Yorks 120 (7.5%)
West Midlands 129 (8.1%)
East Midlands 109 (6.8%)
East Anglia 49 (3.1%)
SW England 126 (7.9%)
SE England 272 (17%)
Greater London 115 (7.2%)
Wales 59 (3.7%)
Scotland 396 (25%)
Profession: Large Employer 242 (15%)
Profession: Small Employer 73 (4.6%)
Profession: Self Employed 628 (39%)
Profession: Employee 64 (4.0%)
Profession: Temporary Worker 480 (30%)
Profession: Junior 77 (4.8%)
1 n (%); Median (IQR)

Comparing proportions

The aim of Ladd and Lenz (2009) is to estimate the effect of “reading a switching newspaper” on respondents’ voting behaviour change between the 1992 and 1997 elections. In terms of our variables in the dataset, they aim to estimate vote_l_97 (Voted Labour in 1997) from vote_l_92 (Prior Labour vote in 1992) as moderated by tolabor (treatment: indicator of whether reading a switching newspaper). All three variables are binary indicator variables.

We have what is called panel data, consisting of measurements on the same individuals at two different time points (1992 and 1997), which allows us to think in causal terms. But the question could be first broken down into two smaller questions exploring:

  • the average treatment effect of tolabour in a cross-sectional design (oblivious of prior vote)
  • a before/after comparison of the treated group, comparing their average vote in 1997 to their average vote in 1992
  • a differences-in-differences comparison of the average changes over time in the treatment group and average changes over time in the control group.

Overall mean vote

What is the overall proportion of those voting Labour in the 1997 elections in the sample?

mean(news$vote_l_97)
[1] 0.4507219

As with all proportions, this can be read as a percentage if multiplied by 100:

mean(news$vote_l_97) * 100
[1] 45.07219

Conditional mean vote

What about the proportion of Labour voters among those who read/not read papers that shifted their editorial support?

# We can first break down the dataset into two:
reader <- news |>  filter(tolabor == 1) 
not_reader <- news |>  filter(tolabor == 0) 

# Then calculate means within each:
mean_reader_97 <- mean(reader$vote_l_97)
mean_not_reader_97 <- mean(not_reader$vote_l_97)

# With the result:
mean_reader_97
mean_not_reader_97
[1] 0.5829384
[1] 0.4305355

Average Treatment Effect

The average treatment effect would be the difference between the two groups:

ATE <- mean_reader_97 - mean_not_reader_97

ATE
[1] 0.1524029

Before/After

mean_reader_92 <- mean(reader$vote_l_92)
mean_not_reader_92 <- mean(not_reader$vote_l_92)

BA_reader <- mean_reader_97 - mean_reader_92
BA_reader

BA_not_reader <- mean_not_reader_97 - mean_not_reader_92
BA_not_reader
[1] 0.1943128
[1] 0.1078148

Differences-in-Differences

DD <- BA_reader - BA_not_reader

DD
[1] 0.08649803

Questions

  • What have we learnt from these various comparisons of proportions?

Linear probability model

The comparison of proportions that we calculated earlier rely on comparing average changes, so we could obtain the results using ordinary least squares linear regression. For example, we can check what the overall mean vote for Labour in 1997 was in the sample, and the average treatment effect we calculated earlier:

## just the mean of Labour vote in 1997
mean_l_97 <- lm(vote_l_97 ~ 1, data = news)
coefficients(mean_l_97)
(Intercept) 
  0.4507219 
## ATE
ATE_reg <- lm(vote_l_97 ~ tolabor, data = news)
coefficients(ATE_reg)
(Intercept)     tolabor 
  0.4305355   0.1524029 

This ATE_reg model is a first step towards fitting the model presented by the authors in Table 1A. They report there on results from a probit regression, but we can start by building a linear regression model that we are familiar with (i.e. we can fit a “linear probability model”):

m0_lpm <- lm(vote_l_97 ~ tolabor + vote_l_92, data = news)
summary(m0_lpm)

Call:
lm(formula = vote_l_97 ~ tolabor + vote_l_92, data = news)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.9981 -0.2114 -0.2114  0.1095  0.7886 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.21137    0.01208  17.495  < 2e-16 ***
tolabor      0.10765    0.02800   3.844 0.000126 ***
vote_l_92    0.67910    0.02016  33.678  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3784 on 1590 degrees of freedom
Multiple R-squared:  0.4226,    Adjusted R-squared:  0.4219 
F-statistic: 581.9 on 2 and 1590 DF,  p-value: < 2.2e-16

Questions

What does this simple model tell us?

What about if we also include the interaction effect between the two predictors?

m0_lpm_int <- lm(vote_l_97 ~ tolabor * vote_l_92, data = news)
summary(m0_lpm_int)

Call:
lm(formula = vote_l_97 ~ tolabor * vote_l_92, data = news)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.92683 -0.20513 -0.20513  0.09641  0.79487 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)        0.20513    0.01235  16.607  < 2e-16 ***
tolabor            0.15921    0.03549   4.486 7.77e-06 ***
vote_l_92          0.69846    0.02174  32.124  < 2e-16 ***
tolabor:vote_l_92 -0.13597    0.05763  -2.359   0.0184 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3779 on 1589 degrees of freedom
Multiple R-squared:  0.4247,    Adjusted R-squared:  0.4236 
F-statistic: 390.9 on 3 and 1589 DF,  p-value: < 2.2e-16

Question: How do we interpret the coefficients on this simple interaction model?

If the interpretation of the interaction model proves difficult (as it should), maybe it’s better to visualise the model by plotting the coefficients. We can do that using the sjPlot::plot_model() function:

sjPlot::plot_model(m0_lpm_int, type = "pred", terms = c("tolabor", "vote_l_92"))

To make the meaning of the plot even more straightforward for our interpretative purposes, we can make some changes to the y-axis to display vertical lines at the probability-levels that we got from the model coefficients, and label them with the name of the regression terms. Check the code below against the information from the model summary:

sjPlot::plot_model(m0_lpm_int, type = "pred", terms = c("tolabor", "vote_l_92")) + 
  scale_y_continuous(breaks = c(0.20513, 
                                0.20513 + 0.15921, 
                                0.20513 + 0.69846, 
                                0.20513 + 0.15921 + 0.69846-0.13597),
                     labels = c("Intercept",
                                "tolabor",
                                "vote_l_92",
                                "tolabor:vote_l_92"))

The model we are interested in reproducing is that from the first column of Table 1A, and that model does not include any interaction terms, only main effects. We will leave out the interaction, bu instead let’s include all the predictors (independent variables) that were included in the original article. Since all the variables in the dataset are included in that model, we can take a shortcut from having to type out all the additive variable names by using the . identifier as a shorthand for “all the variables in the dataset not already mentioned in the regression formula”. This could then be combined with “-” to exclude some variables that we do not want included. In our case, we want to regress the 1997 vote for Labour on all the other variables in the dataset:

m1_lpm <- lm(vote_l_97 ~ ., data = news)

We will be comparing the results from this model with those from other models, so instead of checking the model with the summary() function, let’s use the modelsummary() function from the modelsummary package. The function takes a list() object as argument, and the list can include several models to be tabulated side-by-side. For example, let’s include in the summary table three of the models that we built so far: mean_l_97, m0_lpm and m1_lpm. We can either build the list() object first and pass the object to modelsummary(), or we can build the list as part of the call to the latter:

modelsummary::modelsummary(list(mean_l_97, m0_lpm, m1_lpm))
This is what the modelsummary table should look like:
 (1)   (2)   (3)
(Intercept) 0.451 0.211 0.485
(0.012) (0.012) (0.163)
tolabor 0.108 0.105
(0.028) (0.028)
vote_l_92 0.679 0.280
(0.020) (0.046)
conservative −0.003
(0.043)
labor 0.156
(0.042)
liberal 0.083
(0.044)
white −0.193
(0.063)
wkclass 0.041
(0.022)
parent_labor 0.032
(0.021)
f_ideo92 0.010
(0.080)
vote_c_92 −0.079
(0.048)
vote_lib_92 −0.076
(0.047)
labfel92 0.224
(0.054)
confel92 −0.059
(0.052)
know_3low 0.085
(0.030)
know_3medium 0.026
(0.022)
TVnewseither 0.024
(0.021)
read_paper −0.030
(0.021)
ideo92 0.134
(0.105)
auth92 −0.002
(0.077)
tusa92 0.007
(0.023)
copemg92 −0.051
(0.019)
hedqul92College Degree −0.006
(0.034)
hedqul92Less than O level (or foreign qual) −0.028
(0.042)
hedqul92O Level or Equivalent −0.003
(0.035)
hedqul92Some Higher Education −0.035
(0.034)
hhincq9220,000 or More −0.001
(0.026)
hhincq925999 or Less 0.001
(0.033)
hhincq926000-11,999 0.065
(0.029)
hhincq92Not answered 0.003
(0.039)
ragect9225-34 −0.103
(0.045)
ragect9235-44 −0.068
(0.045)
ragect9245-54 −0.102
(0.046)
ragect9255-59 −0.155
(0.053)
ragect9260-64 −0.082
(0.054)
ragect9265+ −0.113
(0.050)
ragect92Not Given −0.168
(0.088)
rsex92 −0.026
(0.022)
region2 0.020
(0.051)
region3 −0.029
(0.053)
region4 −0.043
(0.053)
region5 0.021
(0.054)
region6 0.069
(0.066)
region7 −0.077
(0.053)
region8 −0.029
(0.048)
region9 −0.027
(0.055)
region10 −0.025
(0.063)
region11 −0.028
(0.046)
occupation2 0.003
(0.077)
occupation3 −0.024
(0.086)
occupation4 0.007
(0.074)
occupation5 0.037
(0.084)
occupation6 0.002
(0.075)
occupation7 −0.110
(0.084)
Num.Obs. 1593 1593 1593
R2 0.000 0.423 0.512
R2 Adj. 0.000 0.422 0.495
AIC 2300.8 1429.8 1265.2
BIC 2311.6 1451.3 1560.8
Log.Lik. −1148.411 −710.907 −577.622
F 581.949 30.417
RMSE 0.50 0.38 0.35

Questions

  • How would you interpret the main variables interest (reading a switching newspaper, and having voted Labour in 1992) in the context of this larger model?
  • Look at a few other explanatory variables and briefly interpret the meaning of their coefficients.

Logit and Probit

While a linear probability model can be a quick way to check relationships in the data, the linear model function is not appropriate for modelling binary outcomes. One reason is that predictions from the linear model can fall outside the 0-1 probability boundary, which is meaningless (we’ll look at what this means, visually, below).

Two models are more appropriate in such contexts: logistic regression (or logit model) and probit models. The authors of the paper use a probit model, and we’ll fit both a probit and a logit so we can compare their results. In R we can fit these types of generalised linear models using the glm() function by specifying a family of distributions and the linear link to be used. Both logit and probit are part of the broader binomial family, and the default link function for this family in glm() is “logit”, so we don’t need to specify that information when fitting a logistic model:

# A "probit" model that reproduces Table 1A results
m1_probit <- glm(vote_l_97 ~ ., family = binomial(link = "probit"), data = news)

# A "logit" model alternative
m1_logit <- glm(vote_l_97 ~ ., family = binomial, data = news)

Before examining the models more closely, let’s visualise predictions from the linear probability, probit and logit model to demonstrate the prediction boundary problem posed by the linear probability model. The three graphs below plot the estimated probabilities for voting Labour in 1997 (vote_l_97) based on the respondents’ “leftist” ideological orientation in 1992 (ideo92) alongside the model-specific regression line (the reason for choosing the ideo92 variable is simply that it’s a numeric variable that is easier to visualise here). The code to produce the plots is not the focus here, but you can unfold the code command to check it if interested:

Code
# Extract predicted values as data-frames using the marginaleffects::predictions function
lpm_effects <- marginaleffects::predictions(m1_lpm)
logit_effects <- marginaleffects::predictions(m1_logit)
probit_effects <- marginaleffects::predictions(m1_probit)

# A package to combine plots together
library(patchwork)

# lpm
plot1 <- lpm_effects |> 
  ggplot(aes(
    y = estimate, 
    x = ideo92)) +
  geom_point(alpha = 0.1) +
  stat_smooth(method="lm", color="red", se=FALSE) +
  labs(
    x = "",
    y = "Estimated probability of voting Labour in 1997"
  ) +
  ylim(-0.25, 1.25) +
  ggtitle("Linear probability model")

# probit
plot2 <- probit_effects |> 
  ggplot(aes(
  y = estimate, 
  x = ideo92)) +
  geom_point(alpha = 0.1) +
  stat_smooth(method="glm", color="blue", linetype=1, se=FALSE,
                method.args = list(family=binomial)) +
  stat_smooth(method="lm", color="red", se=FALSE) +
    labs(
    x = "Leftist political ideology (in 1992)",
    y = ""
  ) +
  ylim(-0.25, 1.25) +
  ggtitle("Probit model")

# logit
plot3 <- logit_effects |> 
  ggplot(aes(
  y = estimate, 
  x = ideo92)) +
  geom_point(alpha = 0.1) +
  stat_smooth(method="glm", color="blue", linetype=1, se=FALSE,
                method.args = list(family=binomial)) +
  stat_smooth(method="lm", color="red", se=FALSE) +
    labs(
    x = "",
    y = ""
  ) +
  ylim(-0.25, 1.25) +
  ggtitle("Logit model")

# Combine the three plots
plot1 + plot2 + plot3

We can see from the plots that the predictions regression curve from the probit and logit models are nearly identical and are both constrained between the probability limits of 0 and 1, whereas predictions from the linear model can fall outside these limits, even though to speak of probabilities below 0 and above 1 is meaningless, so our estimates are biased. The graphs highlight well the difference between logit/probit and linear regression.

We can now compare our three models in more detail using the modelsummary::modelsummary() function. This time, we will first make a list() object in which we give more meaningful titles to our models, and add a further specification to the modelsummary command to request that the coefficients be renamed using their variable labels (which we conveniently already have in this dataset, as we imported it from Stata format):

models <- list("OLS" = m1_lpm,
               "Probit" = m1_probit,
               "Logit" = m1_logit)
modelsummary::modelsummary(models, coef_rename = TRUE)
This is what the table should look like:
OLS Probit Logit
(Intercept) 0.485 −0.209 −0.592
(0.163) (0.845) (1.585)
Treatment (read switching paper) 0.105 0.484 0.840
(0.028) (0.127) (0.225)
Prior Conservative Identification −0.003 0.039 0.076
(0.043) (0.182) (0.320)
Prior Labour Identification 0.156 0.569 1.016
(0.042) (0.178) (0.313)
Prior Liberal Identification 0.083 0.368 0.657
(0.044) (0.181) (0.316)
White −0.193 −0.835 −1.653
(0.063) (0.319) (0.567)
Working-Class 0.041 0.160 0.300
(0.022) (0.097) (0.173)
Parents Voted Labour 0.032 0.121 0.238
(0.021) (0.094) (0.168)
Prior Ideological Moderation 0.010 0.526 0.985
(0.080) (0.452) (0.873)
Prior Labour Vote 0.280 0.879 1.496
(0.046) (0.188) (0.330)
Prior Conservative Vote −0.079 −0.251 −0.436
(0.048) (0.202) (0.353)
Prior Liberal Vote −0.076 −0.273 −0.502
(0.047) (0.190) (0.329)
Prior Labour Party Support 0.224 0.878 1.612
(0.054) (0.238) (0.423)
Prior Conservative Party Support −0.059 −0.334 −0.455
(0.052) (0.233) (0.417)
Prior Political Knowledge [low] 0.085 0.367 0.710
(0.030) (0.136) (0.244)
Prior Political Knowledge [medium] 0.026 0.112 0.233
(0.022) (0.101) (0.183)
Prior Television Viewer 0.024 0.141 0.266
(0.021) (0.096) (0.174)
Prior Daily Newspaper Reader −0.030 −0.178 −0.314
(0.021) (0.096) (0.173)
Prior Ideology 0.134 1.076 2.044
(0.105) (0.599) (1.170)
Authoritarianism −0.002 −0.157 −0.077
(0.077) (0.360) (0.647)
Prior Trade Union Member 0.007 0.051 0.088
(0.023) (0.104) (0.188)
Prior Coping Mortgage −0.051 −0.202 −0.364
(0.019) (0.087) (0.154)
Prior Education [College Degree] −0.006 −0.016 0.038
(0.034) (0.157) (0.282)
Prior Education [Less than O level (or foreign qual)] −0.028 −0.093 −0.119
(0.042) (0.194) (0.348)
Prior Education [O Level or Equivalent] −0.003 0.019 0.049
(0.035) (0.160) (0.286)
Prior Education [Some Higher Education] −0.035 −0.133 −0.233
(0.034) (0.155) (0.279)
Prior Income [20,000 or More] −0.001 0.019 0.019
(0.026) (0.117) (0.211)
Prior Income [5999 or Less] 0.001 0.065 0.041
(0.033) (0.153) (0.277)
Prior Income [6000-11,999] 0.065 0.356 0.611
(0.029) (0.132) (0.235)
Prior Income [Not answered] 0.003 0.007 0.072
(0.039) (0.177) (0.315)
Prior Age [25-34] −0.103 −0.416 −0.701
(0.045) (0.199) (0.349)
Prior Age [35-44] −0.068 −0.274 −0.456
(0.045) (0.198) (0.348)
Prior Age [45-54] −0.102 −0.424 −0.764
(0.046) (0.206) (0.364)
Prior Age [55-59] −0.155 −0.743 −1.283
(0.053) (0.243) (0.435)
Prior Age [60-64] −0.082 −0.351 −0.505
(0.054) (0.247) (0.438)
Prior Age [65+] −0.113 −0.544 −0.904
(0.050) (0.231) (0.410)
Prior Age [Not Given] −0.168 −0.718 −1.273
(0.088) (0.422) (0.779)
Male −0.026 −0.130 −0.253
(0.022) (0.101) (0.183)
North West 0.020 0.054 0.072
(0.051) (0.250) (0.451)
Yorks −0.029 −0.147 −0.141
(0.053) (0.260) (0.472)
West Midlands −0.043 −0.215 −0.359
(0.053) (0.255) (0.463)
East Midlands 0.021 0.057 0.063
(0.054) (0.260) (0.470)
East Anglia 0.069 0.305 0.498
(0.066) (0.310) (0.553)
SW England −0.077 −0.390 −0.712
(0.053) (0.260) (0.475)
SE England −0.029 −0.153 −0.226
(0.048) (0.234) (0.426)
Greater London −0.027 −0.194 −0.260
(0.055) (0.269) (0.488)
Wales −0.025 −0.239 −0.308
(0.063) (0.299) (0.543)
Scotland −0.028 −0.177 −0.267
(0.046) (0.227) (0.412)
Profession: Large Employer 0.003 −0.078 −0.079
(0.077) (0.368) (0.655)
Profession: Small Employer −0.024 −0.209 −0.388
(0.086) (0.413) (0.742)
Profession: Self Employed 0.007 −0.050 −0.082
(0.074) (0.352) (0.626)
Profession: Employee 0.037 0.066 0.143
(0.084) (0.395) (0.699)
Profession: Temporary Worker 0.002 −0.066 −0.100
(0.075) (0.357) (0.634)
Profession: Junior −0.110 −0.603 −1.022
(0.084) (0.403) (0.723)
Num.Obs. 1593 1593 1593
R2 0.512
R2 Adj. 0.495
AIC 1265.2 1317.3 1315.6
BIC 1560.8 1607.5 1605.8
Log.Lik. −577.622 −604.647 −603.825
F 30.417 11.497 8.800
RMSE 0.35 0.34 0.34
  1. Compare the coefficients from our probit model to that presented in the first column of Table 1A in the article.
  2. How do our three models compare? There are some easy rules-of-thumb for roughly converting the values of coefficients between the linear, logit and probit models. Try the following:
  • Divide the probit coefficients by 1.6 to get an approximation of the logit coefficients
  • Divide the logit coefficients by 4 to get an approximation of linear coefficients
  • Multiply probit coefficients by 0.4 to get an approximation of linear coefficients

These rules are not very precise in the case of this large model, but try them out on results from smaller models too.

Exercise 2: Predicting the Frequency of Social Interaction using Poisson

For this exercise, we will use data from Weiss et al. (2021) to reproduce the results they present in the first column of their Supplementary Table S6. The article can be accessed online here; the table of interest is on page 78, and a brief description of the results is offered on page 32:

… we explored the possibility that trait differences may be associated with the frequency of social interactions … reported during the ESM [experience-sampling method] period. To this end, we conducted … Poisson regression analyses, controlling for overall participant response rate. As Supplementary Table S6 shows, only few measures predicted such possible selection effects: the overall number of social interactions was reliably predicted only by zero-sum beliefs, such that individuals holding more negative views regarding the antagonistic nature of social interactions and relationships reported a lower number of overall interactions.

Read the article to get a better understanding of the data. For our immediate purposes, it’s enough to know that we want to regress a measurement of the number of social interactions on a set of demographic and dispositional predictors.

We can load the dataset called EverydayTrust.Rds from the course website. Note that if we are reading in data files from an internet source with the readRDS() function, we need to call it using the url() function. For simplicity, we’ll abreviate the name of the data object as et.

et <- readRDS(url("https://cgmoreh.github.io/HSS8005-data/EverydayTrust.Rds"))

Have a look at the dataset. Note that variables with the suffix “_c” are grand-mean-centred versions of the base variables; the Gender variable has a dummy-coded version and a so-called simple effect-coded version (Gender_e, coded as -1 for Female and 1 for Male), with which the intercept estimates the grand mean rather than the value 0. The authors use this latter version when fitting the model. The variable names are similar to the labels that appear in Table S6, so they are easy to identify.

  1. Fit the Poisson model using the variables shown in Table S6 using the formula skeleton below:
... <- glm(... ~ ..., family="poisson", data = ...)
  1. Check the model summary and compare the results with those in Table S6

Call:
glm(formula = Number_of_Social_Interactions ~ Gender_e + Age_c + 
    Political_Orientation_c + Religiosity_c + Trust_Scale_c + 
    Distrust_c + Moral_Identity_c + Zero_Sum_Belief_c + Social_Value_Orienation_c + 
    Signal_Response_Rate, family = "poisson", data = et)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-4.2050  -0.7750  -0.0146   0.7306   3.0191  

Coefficients:
                           Estimate Std. Error z value Pr(>|z|)    
(Intercept)                1.017743   0.075622  13.458  < 2e-16 ***
Gender_e                  -0.024457   0.017591  -1.390 0.164427    
Age_c                      0.002571   0.001550   1.659 0.097146 .  
Political_Orientation_c   -0.004787   0.008308  -0.576 0.564502    
Religiosity_c              0.014473   0.017348   0.834 0.404112    
Trust_Scale_c              0.016144   0.021150   0.763 0.445264    
Distrust_c                -0.037711   0.019505  -1.933 0.053187 .  
Moral_Identity_c          -0.020356   0.017854  -1.140 0.254216    
Zero_Sum_Belief_c         -0.068133   0.018167  -3.750 0.000177 ***
Social_Value_Orienation_c  0.002162   0.001484   1.456 0.145263    
Signal_Response_Rate       0.068348   0.003474  19.675  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1144.66  on 397  degrees of freedom
Residual deviance:  602.97  on 387  degrees of freedom
  (29 observations deleted due to missingness)
AIC: 2272.9

Number of Fisher Scoring iterations: 4

References

Ladd, Jonathan McDonald, and Gabriel S. Lenz. 2009. “Exploiting a Rare Communication Shift to Document the Persuasive Power of the News Media.” American Journal of Political Science 53 (2): 394–410. https://doi.org/10.1111/j.1540-5907.2009.00377.x.
McElreath, Richard. 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Second. CRC Texts in Statistical Science. Boca Raton: Taylor and Francis, CRC Press.
Weiss, Alexa, Corinna Michels, Pascal Burgmer, Thomas Mussweiler, Axel Ockenfels, and Wilhelm Hofmann. 2021. “Trust in Everyday Life.” Journal of Personality and Social Psychology 121: 95–114. https://doi.org/10.1037/pspi0000334.
Presentation
Handout